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How  good  are  global  Newton  methods.   Part  1 

A.  A.  Goldstein* 

ABSTRACT:  1)  Relying  on  a  theorem  of  Nemerovsky  and  Yuden(1979)  a  lower  bound 
is  given  for  the  efficiency  of  global  Newton  methods  over  the  class  C1(f.i,  A)  defined  below. 
2)  The  efficiency  of  Smale's  global  Newton  method  in  a  simple  setting  with  a  non-singular, 
Lipschitz-continuous  Jacobian  is  considered.  The  efficiency  is  characterized  by  2  param- 
eters, the  condition  number  Q  and  the  smoothness  S,  defined  below.  The  efficiency  is 
sensitive  to  S,  a*   I  insensitive  to  Q. 

KEYWORDS:  Jlobal  Newton  methods,  unconstrained  optimization,  computational  com- 
plexity 

Global  Newton  methods  are  considered  by  some  to  be  methods  for  minimizing  a  "strongly" 
convex  function  f  defined  on  a  real  Hilbert  space  E.  Strongly  convex  means  that  f  is  twice 
difFerentiable  with  a  Hessian  that  is  bounded  from  above  and  below.  By  C(/u,  A)  we  denote 
the  set  of  all  strongly  convex  functions  whose  Hessian  is  bounded  below  by  \i  and  above 
by  A.  The  Hessian  is  invertible  so  that  Newton's  method  is  well  defined  for  every  point  in 
E.  Moreover  a  strong  convex  function  achieves  a  minimum,  where  V/(x)  =  0.  However 
Newton's  method  may  not  converge  to  a  root  of  V/(a;)  =  0  from  arbitrary  points  in  E. 
This  is  a  raison  d'etre  for  the  Global  Newton  methods.  These  methods,  whose  ingredients 
contain  Newton  steps,  generate  sequences  that  converge  for  every  strongly  convex  functions 
and  any  starting  point  in  E.  The  convergence  rate  is  asymptotically  superlinear.  An 
early  history  of  this  subject  may  be  found  in  Polak(1973),  who  cites  contributions  of 
Goldstein(1965),  Pshenichnyi(1970),  and  Robinson(1972).  More  recent  work  is  due  to 
Bertsekas(1982),  Dunn(1980),  Hughes  and  Dunn(1984),  and  others.  All  of  these  results 
give  estimated  asymptotic  rates  of  convergence.  Global  Newton  methods  for  finding  roots 
go  back  to  at  least  1934.  They  are  related  to  continuation  methods.  An  early  history 
and  discussion  may  be  found  in  Ortega  and  Rheinboldt(1970,p235),  who  credit  the  basic 
idea  to  Lahaye(1934,1948).  Current  references  may  be  found  in  Smale(  1986-2).  In  general 
we  regard  a  global  Newton  method  as  any  algorithm  incorporating  Newton  steps  that 
that  generates  a  finite  sequence  terminating  in  an  approximate  root.  This  is  a  point  from 
which  the  ordinary  Newton's  method  will  converge.  Other  algorithms  are  available  that 
terminate  in  an  approximate  root.  The  efficiency  or  iteration  count  of  2  such  algorithms 
will  be  compared  to  a  global  Newton  method.  The  word  "algorithm"  as  used  in  this  paper 
should  be  taken  with  "a  grain  of  salt".  We  assume  information  that  is  not  given  with  real 


supported  by  grants  NIH  RR01243-05  AND  NPS  LMC-M4E1 


problems.  Our  excuse  for  doing  this  is  that  we  hope  thereby  to  gain  insight  and  motivation 
for  the  future  construction  of  good  algorithms. 

The  efficiency  of  a  Global  Newton  method  was  probably  first  analyzed  by  Kung(1976), 
using  natural  assumptions  that  imply  a  non-vanishing  Jacobian.  It  appears  that  the  next 
such  result  is  due  to  Smale(  1986-2)  who  established  a  global  Newton  method  in  the  general 
setting  of  an  analytic  mapping  between  Banach  spaces,  both  real  or  both  complex.  We 
revisit  this  problem  below.  Our  assumptions  are  close  to  Kung's  but  our  algorithm  follows 
Smale.  The  first  part  of  this  paper  will  show  that  the  class  of  strongly  convex  functions 
and  thus  any  more  generalized  classes  that  include  the  strongly  convex  functions  are  not  a 
suitable  setting  for  Newton's  method;  hence,  also  not  suitable  for  global  Newton  methods. 
Unfortunately,  this  is  the  setting  for  the  asymptotic  convergence  proofs  mentioned  above. 

Consider  the  class  C  (/i,A)  having  the  following  definition.  Let  F  be  a  continuously  dif- 
ferent iable  map  from  a  separable  real  Hilbert  space  H  into  itself.  The  inner  product  in  H 
will  be  denoted  by  [  ,  ].  Let  D(x)  denote  the  Frechet  derivative  of  F  at  x.  By  C1(fi,X) 
we  denote  the  set  of  all  maps  F  for  which  fi\\h\\  <  \\D(x)h\\  <  X\\h\\  for  all  h,  x  6  H. 
with  ll  >  0  .  Let  Q  =  -.  Assume  that  the  linear  operator  D(x)  has  an  inverse.  We 
shall  show  that  no  global  Newton  method  (or  any  other  algorithm)  can  do  better  than 
linear  convergence  at  a  certain  determined  rate  over  every  member  of  the  above  class.  Any 
algorithm  that  can  achieve  this  rate  is  called  an  optimal  algorithm.  For  the  special  case 
of  C(/x,  A)  a  simple  algorithm  due  to  Nesterov  (1983 )is  optimal  to  within  a  multiplicative 
constant.  The  convergence  rate  is  linear.  Nesterov 's  algorithm  does  not  require  inversions; 
-it  is  similar  to  the  gradient  method.  Any  application  of  Newton's  method  requires  the 
computation  of  an  inverse  operator  or  the  solving  a  system  of  linear  equations.  If  the 
dimension  of  H  is  small  we  usually  are  willing  to  pay  the  price  of  solving  equations  to  gain 
the  possibility  of  quadratic  convergence.  The  convergence  estimates  for  the  global  Newton 
method  in  the  space  C2(/i,  X,L)  that  is  a  subset  of  C1(/ii,  A)  with  D(x)  satisfying  a  uniform 
Lipschitz  constant  L  show  arbitrarily  slow  convergence  for  sufficiently  large  values  of  L. 
For  the  special  case  when  D(x)  is  everywhere  self-adjoint  we  exhibit  a  gradient  algorithm 
whose  efficiency  is  insensitive  to  L  .  For  this  case  the  estimate  for  the  gradient  method  is 
superior  to  that  of  the  global  Newton  method  when  L  is  sufficiently  large.  However  the 
gradient  method  is  sensitive  to  Q,  while  the  global  Newton  method  is  not.  Thus  for  fixed 
L  and  large  enough  values  of  Q  the  situation  is  reversed. 

It  is  a  pleasure  to  thank  Brad  Bell  for  discussions  and  helpful  criticisms. 


REMARK  1.  a).  If  F  is  E  C1^,  A)  then  any  stationary  point  of  ||F(a:)||2  is  a  root  of  F. 

PROOF  Let  f(x)  =  [F(x),F(x)].  The  differential  f'(x,h)  =  2[F(x),D(x)h],  where  D(x)h 
=  F'(x,h).  Let  h  =  D-1(x)F(x).  If  x  is  stationary  then  f'{x,h)  =  0  =  2[F(x),  F(x)]. 
Whence  F(x)  =  0. 

b). EXISTENCE.  In  view  of  1.  above,  if  in  addition  f  has  compact  level  sets  then  F  has 
roots. 

Let  C(fi,  A)  denote  the  set  of  twice  differentiate  convex  functions  with 

H\\h\\2<f"(x,h,h)<\\\h\\2 

for  all  x  and  h  €  H,  and  some  positive  \i  <  A.  The  class  C(fi,  A)  is  called  a  set  of  "strongly 
convex"  functions.  The  number  Q  =  A//z  is  called  the  condition  number. 

ALGORITHMS  By  an  algorithm  A(g)  where  g  G  C(fi,  A)  we  mean  a  recurrence  relation 
that  calculates  Xk+i  using  some  of  the  values  of  <?,  g'  and  g"  at  xs,  s=0,l,2,...,k,  with  x0 
arbitrarily  given.  A(g)  is  a  special  case  of  a  "local  method" defined  by  Nemerovsky  and 
Yuden,  1981.  By  an  algorithm  B(F)  defined  on  C1(^,A),  we  mean  a  recurrence  relation 
that  calculates  £jt+i  using  some  of  the  values  of  F  and  F'  at  x3,  0  <  s  <  k,  with  xq 
arbitrarily  given.  B(F)  is  also  an  instance  of  a  local  method.  We  shall  assume  that  all 
global  Newton  methods  are  B(F)  algorithms. 

Let  C\,(//,  A)  denote  a  subset  of  C1(/j,,  A)  for  which  D(x)  is  self-adjoint  with  spectral  bounds 
[i  and  A  for  all  x  £  H.  For  F  E  Cs([i,  A)  we  can  associate  a  "potential"  function  f  (Vainberg 
1955)  such  that  Vf(x)  =  F(x)  for  all  x  in  H.  (Actually,  an  equivalence  class  of  functions, 
differing  from  each  other  by  a  constant).  Also  for  every  /  (E  C(/i,  A)  there  corresponds  a 
F  G  Ca(/i,  A).  The  function  f  is  weakly  lower  semi-continuous  and  the  level  sets  of  f  are 
weakly  sequentially  compact.  This,  and  the  strong  convexity  of  f  implies  that  there  exists 
a  unique  minimizer  for  f,  say  z. 

When  any  formula  below  is  followed  by  the  word  "steps"  we  mean  that  the  formula  is  to 
be  rounded  up  to  the  nearest  integer.  We  rely  on  the  following  claim. 

THEOREM  l.)NEMEROVSKY-YUDEN(1979)  Given  a  positive  e  <  1,  a  fixed  but  arbi- 
trary point  x0  £  H  and  an  algorithm  A(f ),  there  exists  a  function  f  G  C*(/i,  A)  such  that 
if  Xk  generated  by  A(f)  reduces  (f(xk)  —  f(z))  to  less  than  e(f(x0)  —  f(z))  then  k  exceeds 
the  number: 

c  [min(n,  yQ)/(ln  min(n,  v  Q))  J  In  -  =  R  In  -  steps 


Here  c  is  a  positive  constant,    Q  is  >  2,  and  z  =  argmin  f. 

REMARK  2.  For  n  and  Q  sufficiently  large  and  e  sufficiently  small  the  above  bound  may 
be  increased  to: 


:y/Q]n- 


Given  a  positive  e  <  1  and  any  function  f  E  C(/i,A)  there  is  an  algorithm  A  that  yields 
f(xk)  -  f(z)  <  (f(x0)  -  f(z))e  whenever  k  exceeds  4^Q(\n2)~1  lne"1  =  R'  lne"1.  This 
algorithm  is  due  to  Nesterov(1983).  It  is  essentially  optimal,  and  can  only  be  improved  by 
a  decrease  in  the  constant  factor  4/  In  2.  Stated  otherwise,  the  algorithm  A  applied  to  any 
function  f  E  C(fi,  A)  generates  a  sequence  xk  that  satisfies  (f(xk)  —  f(z))/(f(xo)  —  f(z))  < 
((e-(1/*'))*,    fori  <Jb<oo. 

The  algorithm  A  (that  will  be  called  GRAD1  below)  may  also  be  taken  to  be  the  gradient 
method  with  step  length  1/A  .  This  algorithm  requires  no  information  about  the  values 
of  the  function  f,  while  Nesterov's  does.  Observe  that  the  linearly  converging  sequence 
^e-(i/R)^k ^  fc  _  i}2, 3, ...  is  for  each  k  a  lower  bound  for  the  relative  decrease  of  some 
function  f  in  C{fi,  A)  at  xk  ,  while  the  sequence  (e~^'R  ')k  is  an  upper  bound  for  the 
relative  decrease  for  any  function  in  C(^/,  A).  For  the  gradient  method  above  the  sequence 
is  (e1^))*. 

This  prompts  us  to  call  the  class  C(/z,  A)  "esslinearly  convergent1',  that  is  every  function  in 
the  class  can  be  made  to  converge  no  slower  than  linearly,  but  sup  {(f(xic)  —  f(z))/(f(x0)  — 
f(z))  :  /  E  C(/i,  A)},  k=l,2, 3,. ...cannot  converge  faster  than  linearly.  For  brevity  we  shall 
refer  to  this  latter  property  as  "sublinearly  convergent".  We  now  observe  that  the  class 
Cs([i,\)  is  also  esslinearly  convergent. 

LEMMA  1  Given  F  E  Cs(fi,  A),  let  f  denote  any  potential  function  for  F.  Let  z  =  argmin 
f.  The  following  inequalities  obtain: 


(2Q)-1[(f(x)  -  f(z))/(f(x0)  -  f(z))Y'2    <  \\F(x)\\/\\F(x0)\\ 
<   2Q[(f(x)  -  /(*))/(/(*„)  -  f{z))Y'2         (A) 

Moreover,  if 

f(x)-f(z)<e2(f(x0)-f(z))/4Q3    then    \\F(x)\\  <  e\\F(x0)\\  (B) 
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PROOF  By  the  strong  convexity  of  f  and  Taylor's  theorem  we  get: 

!i\\x  -  zf  <  f(x)  -  f(z)  <  ±\\x  -  z\f  (G) 

By  the  generalized  mean  value  theorem  and  the  convexity  of  f  we  get: 


and 


By  (a)  and  (b)  we  have: 


||F(z)||<A||;r-z||         (6) 


/(*)-/M<||F(*)||||a:-*||         to 


11^)11 


ll*X*o)|| 


<  A 


2(/(s)  -  /(*)) 
H\\F(**W 


-,1/2 


(d) 


By  (a)  and  (c)  we  get 


and 


f 


\\F(x)\\>$\\z-z\\         (e) 


11^)11  >yf(/w-/w)1/2     (/) 

To  prove  (B)  we  find  that  using  the  hypotheses  of  (B)  together  with  (d)  that 


ll^o)|| 


<  A 


2e2(/(s0) -/(*)) 

4^1^(^)112 


1/2 


Using  (e)  we  find  that  the  right  hand  side  is  less  than  or  equal  to 


A 


2eHf(x0)-f(z)) 


1/2 


4jiQVII*o-*||2/4. 
Now  using  (a)  the  above  expression  is  less  than  or  equal  to  e. 

We  now  turn  to  the  proof  of  (A).  Using  (f )  and  (d)  we  find  that 

v  \\f(x)\\  >  u{x)-f{z)yi*  >  ( f(x)  -  f(z)  y/2  ^ 


V  \\F{x*)\\ 


\\F{xQ, 


f(xQ)-f(z)J       V2A 


.  This  proves  the  left  side  of  (A).  The  right  hand  inequality  is  proved  similarly,  using  (d) 
and  (f). 


LEMMA  2  The  class  Cs(  /i,A)  is  esslinear. 


PROOF  Let  f  be  a  potential  function  corresponding  to  F.  Every  algorithm  B(F)  is  now 
also  an  algorithm  A(f).  Every  function  F  E  Ca(/i,  A)  is  the  gradient  of  some  /  E  C(/z,  A). 
Hence  for  some  F  ,  ||F(xjt)||  /  ||F(x0)||  converges  more  slowly  than  {e~(l'R^)kl2  /2Q.  Now 
take  for  the  algorithm  B  the  gradient  algorithm  mentioned  above.  Again  by  LEMMA  1 
every  function  F  will  converge  under  B  with  at  least  a  linear  rate. 

Since  Cs(fi,\)  is  a  subset  of  C2(//,A),  then  for  some  F  E  C1(//,A),  ||F(x)t)||  /  ||F(x0)|| 
converges  more  slowly  than  {e~^fR^)kl2  /2Q.  Now  for  B(F)  we  take  the  algorithm  GRAD2 
below.  This  algorithm  converges  linearly.  Whence  we  have 

THEOREM  2.  The  class  C1(fi,X)  is  sublinearly  convergent. 

We  now  restrict  the  class  C1(p,,X)  to  enlarge  the  possibility  of  faster  convergence.  Let 
C1(//,  A,  L)  denote  a  map  FE  C1(fi,  A)  for  which  ||.D(x)  —  -D(y)||  <  L  \\x  —  y||,  for  all  x,y  E 
H.  The  following  well-  known  theorem  is  adjusted  for  our  present  setting. 

THEOREM  3  KANTOROVICH(1948)  Take  x0  E  H..  Let  0(xo)  =  ||(Z)-1(a:o))||  and 
7j(x0)  =  ||(D~1(x0))i?(a:o)||.  Assume  that  \\D(x)  -  D(y)\\  <  A{x0)\\x  -  y\\  for  all  pairs  x,y 
in  the  ball  B(x0)  =  {x  E  H  :  \\x  -  x0\\  <  2r?(x0)}.  If  r)(x0)(3{x0)A{x0)  =  h(x0)  <  1/2. 
then  F  has  a  root  z  such  that  z  is  in  the  ball  B(xo),  the  Newtonian  iterates  x 3  denned  by 
Xj+i  =Xj  -D~1(xj)F(xi)  lie  in  B(x0),  and  \\xj  -  z\\  <  21-j(2h(x0))2'  ^tj^xo). 

A  convenient  terminology  similar  to  Smale's  is  that  under  the  above  circumstances  .r0  is 
an  "approximate  root". 

In  what  follows  we  shall  take  h(xo)  =  1/4. 

REMARK  3.  The  condition  for  an  approximate  root,  r}(xo)(3(xo)i\(xo)  <  1/4  has  the 
equivalent  condition  for  an  approximate  root  as: 

\\F(x0)\\  <  a(x0)  =  l/[4P(x0)A(x0)\\D-\x0)F(xo)/\\F(x0)\\  ||] 


REMARK  4.  We  have  for  all  x  E  H  global  estimates  for  rj(x),  /3(x),  and  A(x).  Namely 
0(x)  <  1//^,  rj(x)  <  \\F(x)\\/fj,  and  A(x)  <  L.  From  these  estimates  we  get: 

a(x)  >  (j2 /AL  =  a 

If    ||.F(:r)||  <  a  then  x  is  an  approximate  root,  and 
n(x)  <  fj./4L. 

7 


In  many  problems  a  is  so  small  that  the  desired  accuracy  tolerance  is  achieved  before  an 
approximate  root  is  achieved.  Thus  the  efficiency  of  a  global  Newton  method  in  reaching 
an  approximate  root  is  a  crucial  question.  We  now  turn  to  our  version  of  Smale's  algorithm 
which  we  shall  denote  by  "SGN".  In  what  follows  a(x{)  will  be  denoted  by  al. 

REMARK  5.  In  what  follows  the  constants  /i,A,  and  L  need  not  be  finite  over  the  entire 
space  H,  but  rather  on  the  set  S  =  {x  6  H  :  ||jP(x)||  <  ||.F(xo)||}. 

LEMMA  3.  Assume  F  £  C1(fi,X)  and  x0  is  arbitrarily  given  in  H.  If  ||F(x0)||  <  a0  then 

xq  is  an  approximate  root.    If  not  we  define  a  sequence  xq,   t\,   x\,  ^2, inductively  as 

follows.  Given  x,  set 

\\F(x,)\\  -  a, 

<,+I  "       ||F(x,)||  (a) 

Choose  Xi+i  to  satisfy 

\\F(xl+1)-tl+1F(xt)\\<at/2         (b) 

Then 

\\F(xi+1)\\<\\F(x0)\\-(i  +  l)a/2 

where  a  is  defined  as  in  Remark  4. 

PROOF.  We  show  first  that  x,+1  can  be  chosen  to  satisfy  (b).  Let  Gi(x)  =  F(x)  — 
ti+iF(xi).  Since  Gi(xi)  =  a^,  x,  is  an  approximate  root  for  Gn  because  F'  =  G' .  A  few 
Newton  steps  (we  count  them  below)  suffices  to  obtain  x,+i  such  that  ||Gi(x,-+i)||  <  al/2. 
Thus  (b)  can  be  satisfied.  Using  the  triangle  inequality  on  (a)  together  with  (b)  we 
get  that  ||F(xl+1)||  <  \\F(xt)\\  -  at/2.  Whence  ||F(xl+1)||  <  ||F(x0)||  -  \  Ey=0  «i  < 
F(x0)  -  (i  +  l)a/2.  Now  choose  i  so  that  ||F(xJ+1)||  <  a 

CLAIM  1  Let  N  be  the  least  integer  exceeding  2(||F(x0)||  —a)/ a.  Then  for  some  i  <  N,  xx 
is  an  approximate  root  of  F. 

We  now  estimate  the  number  of  Newton  steps  to  move  from  x,  to  x,+1. 

LEMMA  4  Let  {yij}  be  a  sequence  of  Newtonian  iterates  starting  at  yl0  —  xt.  Let  Gi(zt)  — 
0.  Then  we  can  choose  x,-+i  =  y,ft'  where  K  is  the  least  integer  >  1.443  In  (1.443  In  8Q). 

PROOF  We  have  seen  that  x,  is  an  approximate  zero  for  Gi  hence  \\yX]  —  z,||  <  ^(|)2  • 
Then 

WGiivij)  -  Gi(zi)\\  <  \\\(ytJ  -  zi\  <  A/iL-1  C-f . 

Now  choose  K  so  that  X^iL^d)2*  <  a/2,  that  is  {\fK  <  l/(8Q). 
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REMARK  6.  The  above  algorithm  can  be  optimized  by  changing  the  right  hand  side  of 
inequality  (b)  in  the  recursion  above  to  a/q  with  q  >  1.  The  formula  for  N  becomes 
(||F(x0)||  —a)q/a)  and  K  becomes  1.433 In  (1.433  In  4qQ).  Now  choose  q  to  minimize  NK. 

LEMMA  5  Take  F  €  Cl{n,\,L)  Assume  that  D(x)  is  self-adjoint  for  all  x  €  H.  The 
gradient  method  previously  mentioned  below  REMARK  2,  called  Algorithm  A,  that  we 
shall  now  call  "GRAD1"  will,  starting  at  xo,  generate  an  approximate  root  in  K  steps, 
where  K  is  the  smallest  integer  >  Q  In  [||.F(xo)||4QL///2] 

PRO  )F  The  mapping  G  denned  by  G(y)  =  y  —  F(y)/\  has  a  fixed  point  z  satisfying  F(z) 
=  0.  It  is  a  contractor  satisfying  a  Lipschitz  condition  q  =  1  —  1/Q.  Goldstein(1967,  pps 
15  and  24).  Set  G(xn)  =  xn+1  =  xn  -  F{xn)/\.  Then  \\F(xn)  -  F{z)\\  <  X\\xn  -  z\\  < 
Xqn\\x0  -  X!  ||/(1  -  q)  =  \\F(x0)\\qn/(l  -  q).  Now  choose  n  so  that 

\\F(x,)\\qn  <  a{l  -  q) 

Using  the  inequality  —  In  (1  —  (J./X)  >  /^/A,  one  obtains  the  lemma. 

We  now  consider  a  gradient  method  for  the  non-symmetric  case.  We  call  this  gradient 
method  GRAD2. 

ALGORITHM  GRAD2  Take  F  e  CX{^\L),  xQ  e  S  and  set  f(x)  =  ||F(x)||2.  Then  V/(x) 
satisfies  ||V/(x)  -  V/(y)||  <  M\\x  -  y\\  for  all  x  and  y  in  S,  with  M  =  2(A2  +  ||F(x0)[|L). 
Set  (j){x)  —  /(x)V/(x)/||V/(x)||2  .  Given  arbitrary  xo  in  H  set  x^+i  —  x^  —  ^Q(p{xk)  with 
To  =  fJ>2  /2M .  If  k  exceeds 

then  xjt  is  an  approximate  root. 

PROOF  Adding  and  subtracting  (F'{y))*F(x)  we  find  that  ||V/(x)  -  V/(y)||  <  2(A2  -f 
||F(x)||  L)  ||x  -  y||,  and  /(x)  -  /(x  -  7<f>(x))  =  7[V/(x),  <f>(x))  +  j[Vf(x)  -  V/(£),  <j>{x)\. 
Here  ||^  -  x||  <  7M\\<f>(x)\\.  Then  f(x  -  1(f>{x))  <  f(x)  -  7/(x)  +  72M||0(x)||2,  V/(x)  = 
2(F'(x))*F(x),  and  ||V/(x)||  >  2||F(x)||/i.  Then  f(xk+1)  <  f(xk)[l  -  7o  +  7o2^/V]  = 
/(xfc)(l  -  fi2/M).  Taking  square  roots  we  get  that  ||F(a:fc+i)||  <  ||F(x^)||(l  -  /z2/2Af). 
Finally,  choose  k  so  that  ||F(x0)||(l  -  fi2/2M)k  <  /x2/4I, 

Comparing  the  algorithms.  Let 

\\F(x0)\\L 


S  = 


H2 


By  claim  1  and  lemma  4  the  total  number  of  steps  of  SGN  is: 


SGN   :    11.544(5 -.25)ln(1.4431n8Q). 


GRAD2   :   2(S  +  Q2)ln4S 


GRAD1   :   Q\n(4SQ) 

Notice  that  unlike  GRAD1  and  GRAD2,  SGN  is  insensitive  to  the  condition  number  Q! 
However  SGN  is  sensitive  to  S.  GRAD  1  is  sensitive  to  Q  but  not  to  S.  GRAD2  is  sensitive 
to  both  of  these  factors.  In  the  symmetric  case  for  fixed  Q,  GRAD1  is  quicker  than  SGN 
when  ||.F(:ro)||  grows  sufficiently  large  or  if  L/fi2  gets  sufficiently  large.  On  the  other 
hand  for  fixed  S,  SGN  is  quicker  as  Q  grows  sufficiently  large.  In  the  non-symmetric  case 
SGN  is  superior  to  GRAD2  with  respect  to  the  number  of  steps.  When  the  cost  per 
step  is  included,  the  gradient  methods  become  cheaper  in  the  n-dimensional  case  when  n 
is  sufficiently  large.  For  each  Newton  step  an  nxn  system  of  linear  equations  is  solved, 
costing  0(n3)  multiplications.  While  the  corresponding  GRAD2  step  involves  a  matrix 
multiplication  of  an  nxn  and  a  nxl  matrix,  or  n2  multiplications,  and  GRAD1  requires  no 
matrix  operations. 
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